Whole Genome Phylogeny via Complete Composition Vectors
نویسندگان
چکیده
The availability of complete genomic sequences allows us to infer the evolutionary footprints between species in a global strategy. However, the length of these genomic sequences poses a challenge on computational efficiency and optimality of information representation in phylogenetic analyses. In this paper, a new method called complete composition vector (CCV) is described to infer evolutionary relationships between species using their complete genomic sequences. In this method, the character string frequencies in the complete genomic sequence of each species are represented by a complete composition vector in a high-dimensional space. After being filtered out the random mutation background, cosines of the angles between the representing vectors are converted into pairwise evolutionary distances, based on which the phylogeny tree is constructed using the neighbor-joining algorithm. The method bypasses the complexity of performing multiple sequence alignments and avoids the ambiguity of choosing individual genes, whereas is expected to effectively retain the rich evolutionary information contained in the whole genomic sequence. To verify its strengths, the method was applied to infer the evolutionary footprints of coronaviruses and microbes. On a typical desktop PC, it took only one and half days to construct the phylogeny for 109 species containing 103 microbes and 6 eukaryotes. The phylogenetic trees generated by our method are highly consistent with those annotated by biologists. Primary Keyphrases: Phylogenetic analysis, Genome evolution, Genome comparison, Comparative genomics, Computational genetics Secondary Keyphrases: Phylogenetics: algorithms, Phylogenetics: statistical aspects ∗Bioinformatics Research Group, Department of Computing Science, University of Alberta. Edmonton, Alberta T6G 2E8, Canada. Emails: xiaomeng, wgang, [email protected]. †Digital Biology Laboratory, Department of Computer Science, University of Missouri – Columbia. Columbia, Missouri 65211, USA. Emails: wanx, [email protected]. ‡To whom correspondence should be addressed. Fax: (780) 492-1071. Email: [email protected].
منابع مشابه
CVTree update: a newly designed phylogenetic study platform using composition vectors and whole genomes
The CVTree web server (http://tlife.fudan.edu.cn/cvtree) presented here is a new implementation of the whole genome-based, alignment-free composition vector (CV) method for phylogenetic analysis. It is more efficient and user-friendly than the previously published version in the 2004 web server issue of Nucleic Acids Research. The development of whole genome-based alignment-free CV method has p...
متن کاملA Whole Genome Phylogeny Using Truncated Pivoted QR Decomposition
The increasing availability of whole genome sequences in public databases has stimulated the development of new methods to automatically compare and categorize genes and species. Recently developed methods based on the singular value decomposition (SVD) allow for the simultaneous identification and definition of well concerved motifs and gene families using very large whole genome datasets. In ...
متن کاملA Whole Genome Phylogeny Using Truncated Pivoted QR Decomposition
The increasing availability of whole genome sequences in public databases has stimulated the development of new methods to automatically compare and categorize genes and species. Recently developed methods based on the singular value decomposition (SVD) allow for the simultaneous identification and definition of well concerved motifs and gene families using very large whole genome datasets. In ...
متن کاملPhylogeny Based on Whole Genome as inferred from Complete Information Set Analysis.
Previous molecular phylogeny algorithms mainly rely onmulti-sequence alignments of cautiously selected characteristic sequences,thus not directly appropriate for whole genome phylogeny where eventssuch as rearrangements make full-length alignments impossible. Weintroduce here the concept of Complete Information Set (CIS) and itsmeasurement implementation as evolution distance without reference ...
متن کاملWhole-genome phylogeny of mammals: evolutionary information in genic and nongenic regions.
Ten complete mammalian genome sequences were compared by using the "feature frequency profile" (FFP) method of alignment-free comparison. This comparison technique reveals that the whole nongenic portion of mammalian genomes contains evolutionary information that is similar to their genic counterparts--the intron and exon regions. We partitioned the complete genomes of mammals (such as human, c...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2004